17 research outputs found
Smaller, Faster, Greener: Compressing Pre-trained Code Models via Surrogate-Assisted Optimization
Large pre-trained models of code have been adopted to tackle many software
engineering tasks and achieved excellent results. However, their large model
size and expensive energy consumption prevent them from being widely deployed
on developers' computers to provide real-time assistance.
A recent study by Shi et al. can compress the pre-trained models into a small
size. However, other important considerations in deploying models to have not
been addressed: the model should have fast inference speed and minimal energy
consumption. This requirement motivates us to propose Avatar, a novel approach
that can reduce the model size as well as inference latency and energy
consumption without compromising effectiveness (i.e., prediction accuracy).
Avatar trains a surrogate model to predict the performance of a tiny model
given only its hyperparameters setting. Moreover, Avatar designs a new fitness
function embedding multiple key objectives, maximizing the predicted model
accuracy and minimizing the model size, inference latency, and energy
consumption. After finding the best model hyperparameters using a tailored
genetic algorithm (GA), Avatar employs the knowledge distillation technique to
train the tiny model. We evaluate Avatar and the baseline approach from Shi et
al. on three datasets for two popular software engineering tasks: vulnerability
prediction and clone detection. We use Avatar to compress models to a small
size (3 MB), which is 160 smaller than the original pre-trained models.
Compared with the original models, the inference latency of compressed models
is significantly reduced on all three datasets. On average, our approach is
capable of reducing the inference latency by 62, 53, and
186. In terms of energy consumption, compressed models only have 0.8
GFLOPs, which is 173 smaller than the original pre-trained models.Comment: 12 pages, a working-in-progress versio
Stealthy Backdoor Attack for Code Models
Code models, such as CodeBERT and CodeT5, offer general-purpose
representations of code and play a vital role in supporting downstream
automated software engineering tasks. Most recently, code models were revealed
to be vulnerable to backdoor attacks. A code model that is backdoor-attacked
can behave normally on clean examples but will produce pre-defined malicious
outputs on examples injected with triggers that activate the backdoors.
Existing backdoor attacks on code models use unstealthy and easy-to-detect
triggers. This paper aims to investigate the vulnerability of code models with
stealthy backdoor attacks. To this end, we propose AFRAIDOOR (Adversarial
Feature as Adaptive Backdoor). AFRAIDOOR achieves stealthiness by leveraging
adversarial perturbations to inject adaptive triggers into different inputs. We
evaluate AFRAIDOOR on three widely adopted code models (CodeBERT, PLBART and
CodeT5) and two downstream tasks (code summarization and method name
prediction). We find that around 85% of adaptive triggers in AFRAIDOOR bypass
the detection in the defense process. By contrast, only less than 12% of the
triggers from previous work bypass the defense. When the defense method is not
applied, both AFRAIDOOR and baselines have almost perfect attack success rates.
However, once a defense is applied, the success rates of baselines decrease
dramatically to 10.47% and 12.06%, while the success rate of AFRAIDOOR are
77.05% and 92.98% on the two tasks. Our finding exposes security weaknesses in
code models under stealthy backdoor attacks and shows that the state-of-the-art
defense method cannot provide sufficient protection. We call for more research
efforts in understanding security threats to code models and developing more
effective countermeasures.Comment: 18 pages, Under review of IEEE Transactions on Software Engineerin
Answer Summarization for Technical Queries: Benchmark and New Approach
Prior studies have demonstrated that approaches to generate an answer summary
for a given technical query in Software Question and Answer (SQA) sites are
desired. We find that existing approaches are assessed solely through user
studies. There is a need for a benchmark with ground truth summaries to
complement assessment through user studies. Unfortunately, such a benchmark is
non-existent for answer summarization for technical queries from SQA sites. To
fill the gap, we manually construct a high-quality benchmark to enable
automatic evaluation of answer summarization for technical queries for SQA
sites. Using the benchmark, we comprehensively evaluate the performance of
existing approaches and find that there is still a big room for improvement.
Motivated by the results, we propose a new approach TechSumBot with three key
modules:1) Usefulness Ranking module, 2) Centrality Estimation module, and 3)
Redundancy Removal module. We evaluate TechSumBot in both automatic (i.e.,
using our benchmark) and manual (i.e., via a user study) manners. The results
from both evaluations consistently demonstrate that TechSumBot outperforms the
best performing baseline approaches from both SE and NLP domains by a large
margin, i.e., 10.83%-14.90%, 32.75%-36.59%, and 12.61%-17.54%, in terms of
ROUGE-1, ROUGE-2, and ROUGE-L on automatic evaluation, and 5.79%-9.23% and
17.03%-17.68%, in terms of average usefulness and diversity score on human
evaluation. This highlights that the automatic evaluation of our benchmark can
uncover findings similar to the ones found through user studies. More
importantly, automatic evaluation has a much lower cost, especially when it is
used to assess a new approach. Additionally, we also conducted an ablation
study, which demonstrates that each module in TechSumBot contributes to
boosting the overall performance of TechSumBot.Comment: Accepted by ASE 202
Mind Your Data! Hiding Backdoors in Offline Reinforcement Learning Datasets
A growing body of research works has focused on the Offline Reinforcement
Learning (RL) paradigm. Data providers share large pre-collected datasets on
which others can train high-quality agents without interacting with the
environments. Such an offline RL paradigm has demonstrated effectiveness in
many critical tasks, including robot control, autonomous driving, etc. A
well-trained agent can be regarded as a software system. However, less
attention is paid to investigating the security threats to the offline RL
system. In this paper, we focus on a critical security threat: backdoor
attacks. Given normal observations, an agent implanted with backdoors takes
actions leading to high rewards. However, the same agent takes actions that
lead to low rewards if the observations are injected with triggers that can
activate the backdoor. In this paper, we propose Baffle (Backdoor Attack for
Offline Reinforcement Learning) and evaluate how different Offline RL
algorithms react to this attack. Our experiments conducted on four tasks and
four offline RL algorithms expose a disquieting fact: none of the existing
offline RL algorithms is immune to such a backdoor attack. More specifically,
Baffle modifies of the datasets for four tasks (3 robotic controls and 1
autonomous driving). Agents trained on the poisoned datasets perform well in
normal settings. However, when triggers are presented, the agents' performance
decreases drastically by , , and in the four
tasks on average. The backdoor still persists after fine-tuning poisoned agents
on clean datasets. We further show that the inserted backdoor is also hard to
be detected by a popular defensive method. This paper calls attention to
developing more effective protection for the open-source offline RL dataset.Comment: 13 pages, 6 figure
Natural attack for pre-trained models of code
Pre-trained models of code have achieved success in many important software
engineering tasks. However, these powerful models are vulnerable to adversarial
attacks that slightly perturb model inputs to make a victim model produce wrong
outputs. Current works mainly attack models of code with examples that preserve
operational program semantics but ignore a fundamental requirement for
adversarial example generation: perturbations should be natural to human
judges, which we refer to as naturalness requirement.
In this paper, we propose ALERT (nAturaLnEss AwaRe ATtack), a black-box
attack that adversarially transforms inputs to make victim models produce wrong
outputs. Different from prior works, this paper considers the natural semantic
of generated examples at the same time as preserving the operational semantic
of original inputs. Our user study demonstrates that human developers
consistently consider that adversarial examples generated by ALERT are more
natural than those generated by the state-of-the-art work by Zhang et al. that
ignores the naturalness requirement. On attacking CodeBERT, our approach can
achieve attack success rates of 53.62%, 27.79%, and 35.78% across three
downstream tasks: vulnerability prediction, clone detection and code authorship
attribution. On GraphCodeBERT, our approach can achieve average success rates
of 76.95%, 7.96% and 61.47% on the three tasks. The above outperforms the
baseline by 14.07% and 18.56% on the two pre-trained models on average.
Finally, we investigated the value of the generated adversarial examples to
harden victim models through an adversarial fine-tuning procedure and
demonstrated the accuracy of CodeBERT and GraphCodeBERT against ALERT-generated
adversarial examples increased by 87.59% and 92.32%, respectively.Comment: To appear in the Technical Track of ICSE 202
IncBL: Incremental Bug Localization
Numerous efforts have been invested in improving the effectiveness of bug
localization techniques, whereas little attention is paid to making these tools
run more efficiently in continuously evolving software repositories. This paper
first analyzes the information retrieval model behind a classic bug
localization tool, BugLocator, and builds a mathematical foundation that the
model can be updated incrementally when codebase or bug reports evolve. Then,
we present IncBL, a tool for Incremental Bug Localization in evolving software
repositories. IncBL is evaluated on the Bugzbook dataset, and the results show
that IncBL can significantly reduce the running time by 77.79% on average
compared with re-computing the model, while maintaining the same level of
accuracy. We also implement IncBL as a Github App that can be easily integrated
into open-source projects on Github, and users can also deploy and use IncBL
locally. The demo video for IncBL can be viewed at
https://youtu.be/G4gMuvlJSb0, and the source code can be found at
https://github.com/soarsmu/IncBLComment: 4 pages, 2 figure
Revisiting neuron coverage metrics and quality of deep neural networks
Deep neural networks (DNN) have been widely applied in modern life, including
critical domains like autonomous driving, making it essential to ensure the
reliability and robustness of DNN-powered systems. As an analogy to code
coverage metrics for testing conventional software, researchers have proposed
neuron coverage metrics and coverage-driven methods to generate DNN test cases.
However, Yan et al. doubt the usefulness of existing coverage criteria in DNN
testing. They show that a coverage-driven method is less effective than a
gradient-based method in terms of both uncovering defects and improving model
robustness.
In this paper, we conduct a replication study of the work by Yan et al. and
extend the experiments for deeper analysis. A larger model and a dataset of
higher resolution images are included to examine the generalizability of the
results. We also extend the experiments with more test case generation
techniques and adjust the process of improving model robustness to be closer to
the practical life cycle of DNN development. Our experiment results confirm the
conclusion from Yan et al. that coverage-driven methods are less effective than
gradient-based methods. Yan et al. find that using gradient-based methods to
retrain cannot repair defects uncovered by coverage-driven methods. They
attribute this to the fact that the two types of methods use different
perturbation strategies: gradient-based methods perform differentiable
transformations while coverage-driven methods can perform additional
non-differentiable transformations. We test several hypotheses and further show
that even coverage-driven methods are constrained only to perform
differentiable transformations, the uncovered defects still cannot be repaired
by adversarial training with gradient-based methods. Thus, defensive strategies
for coverage-driven methods should be further studied.Comment: Accepted to the RENE Track in SANER 202
Can identifier splitting improve open-vocabulary language model of code?
Statistical language models on source code have successfully assisted
software engineering tasks. However, developers can create or pick arbitrary
identifiers when writing source code. Freely chosen identifiers lead to the
notorious out-of-vocabulary (OOV) problem that negatively affects model
performance. Recently, Karampatsis et al. showed that using the Byte Pair
Encoding (BPE) algorithm to address the OOV problem can improve the language
models' predictive performance on source code. However, a drawback of BPE is
that it cannot split the identifiers in a way that preserves the meaningful
semantics. Prior researchers also show that splitting compound identifiers into
sub-words that reflect the semantics can benefit software development tools.
These two facts motivate us to explore whether identifier splitting techniques
can be utilized to augment the BPE algorithm and boost the performance of
open-vocabulary language models considered in Karampatsis et al.'s work.
This paper proposes to split identifiers in both constructing vocabulary and
processing model inputs procedures, thus exploiting three different settings of
applying identifier splitting to language models for the code completion task.
We contrast models' performance under these settings and find that simply
inserting identifier splitting into the pipeline hurts the model performance,
while a hybrid strategy combining identifier splitting and the BPE algorithm
can outperform the original open-vocabulary models on predicting identifiers by
3.68% of recall and 6.32% of Mean Reciprocal Rank. The results also show that
the hybrid strategy can improve the entropy of language models by 2.02%.Comment: Accepted by ERA Track in SANER 202